Search MGnify Genomes

Python
Author
Affiliation

Virginie Grosboillot

ETH Zurich

This is a static preview

You can run and edit these examples interactively on Galaxy

How to use the “genome search” for MGnify MAGs

This notebook aims to give an non-exhaustive overview on how to use the genome search API for MGnify MAGs.

This notebook is divided in 5 sections: - 1: Libraries needed to run the full notebook - 2: How to query the genomes database from the MGnify API, save and reload the results - 3: Explore the whole genome dataset - 4: Graphical representations - 5: Case study: Find out whether your own MAGs are novel compared to the MGnify catalogues

This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter.

Import libraries

Here are imported the necessary python libraries required to run the examples presented in this notebook.

# Connection to MGnify API
from jsonapi_client import Session as APISession
from jsonapi_client import Modifier
import requests

# Dataframes and display
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

# Data transformation
from functools import reduce

# Plots
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
%matplotlib inline 

# Create signature of MAGs for comparison against database
import sourmash
import glob
import time
from pathlib import PurePath as pp
from Bio import SeqIO

# Warning verbosity
import warnings 
warnings.filterwarnings(action="ignore")

Query the genomes database from MGnify API

For the genome dataset, use genomes endpoint.
A complete list of endpoints can be found at https://www.ebi.ac.uk/metagenomics/api/v1/.

endpoint_name = 'genomes'

The function Modifier from the jsonapi_client module allows us to query for specific values in given fields (e.g.: ‘geographic-origin’, ‘taxon-lineage’).
One way to explore the available fields is to use requests to fetch the API endpoint:

r = requests.get(f"https://www.ebi.ac.uk/metagenomics/api/v1/{endpoint_name}")
r.json()['data'][0]

Get information for a specific genus or species

Examples: Search for available ressources for a specific genus or species of interest.
- Listeria - Listeria monocytogenes

The taxon-lineage field contains domain, phylum, class, order, family, genus, species, subspecies as d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Listeriaceae;g__Listeria;s__Listeria monocytogenes(example for Listeria monocytogenes).
The filter can use the full lineage d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Listeriaceae;g__Listeria or only part of it g__Listeria or Listeria.

Set the desired filter(s)

genus_filter = 'Listeria'
species_filter = 'Listeria monocytogenes'

Query the database with the ‘Listeria’ filter and store the results in a Pandas DataFrame.

with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    search_filter = Modifier(f"taxon_lineage={genus_filter}")
    resources = map(lambda r: r.json, mgnify.iterate(endpoint_name, filter=search_filter))
    resources_df = pd.json_normalize(resources)
# Display the table containing the results of the query
resources_df

Output: Four entries are obtained for Listeria genus.

Query the database with the ‘Listeria monocytogenes’ filter and store the results in a Pandas DataFrame.

with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    search_filter_2 = Modifier(f"taxon_lineage={species_filter}")
    resources_2 = map(lambda r: r.json, mgnify.iterate(endpoint_name, filter=search_filter_2))
    resources_df_2 = pd.json_normalize(resources_2)
# Display the table containing the results of the query
resources_df_2

Output: Two entries are obtained for Listeria monocytogenes species.

Save the results as parquet.

Pandas allow you to save your query results in multiple formats. (See Pandas documentation for more options).

parquet files can be opened with different libraries including (pandas, pyspark, polar…) and has the advantage of being a compressed data format with column types preserved.

resources_df.to_parquet('Listeria_resources.parquet')

Load previously saved parquet files

listeria_df = pd.read_parquet('Listeria_resources.parquet')
listeria_df

Output: Display the saved data as a pandas DataFrame.

Explore the whole genomes dataset.

Query and save the dataset as parquet file

To query the whole dataset, we can use the same method as previously. The only difference is that no filter is passed to the query.

Warning: Querying without filter is computationally expensive and will take time.

A pre-fetched copy of the data (as of 8 November 2022) is available in ../example-data/genomes/all_genome_resources.parquet.

This was fetched using the following code, which you could copy-and-paste into a code cell to re-fetch the latest data:

with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    resources_all = map(lambda r: r.json, mgnify.iterate(endpoint_name))
resources_all_df = pd.json_normalize(resources_all)
resources_all_df
resources_all_df.to_parquet('latest_genome_resources.parquet')

Load the dataset file with another python library

Here is an example to show that the saved parquet files can be loaded with different python libraries (some being more suitable for loading and transforming large amount of data). Of course, this file can also be loaded with pandas (see section 1.3.3. of this notebook).

For this example, we use pyspark. For help, see the documentation: https://spark.apache.org/docs/latest/api/python/.

Pyspark require to start a SparkSession before use. A basic example is shown in the section below.

Start Spark Session

from pyspark.sql import SparkSession

import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()

Load the data from the parquet file

all_genomes_df = spark.read.parquet('../example-data/genomes/all_genome_resources.parquet')

Get the shape of the DataFrame

# To get the number of rows
all_genomes_df.count()
# To get the number of columns
len(all_genomes_df.columns)

Outputs: The dataframe has 9421 rows and 37 columns.

Get stats when applicable

PySpark can show statistics of the data in the parquet dataframe:

all_genomes_df.describe().show(truncate=False, vertical=True)

Ouputs: For example: - the first record gives the number of non-null values for each column. - gc-content on the whole dataset is 47.951392633478235 ± 10.364966184467969 % with a min value of 22.94% and a max value of 74.84% (if you are using the example dataset)

Note: When using Spark, column names containing ' ' are translated to e.g. column.name

# To visualise only gc-content:
all_genomes_df.select('`attributes.gc-content`').describe().show()

Get the most represented genus in the dataset

# To see a sample of taxon-lineages present in the dataset:
all_genomes_df.select(f'`attributes.taxon-lineage`').show(truncate=False)
# The total number of genomes in the dataset:
all_genomes_df.select('`id`').distinct().count()
# The number of distinct lineages:
all_genomes_df.select('`attributes.taxon-lineage`').distinct().count()

Split the taxon-lineage column into 7 columns

features = ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']
all_genomes_tax_df = reduce(lambda df, i: df.withColumn(features[i], F.col('lineage_split')[i]),
    range(len(features)),
    all_genomes_df.withColumn('lineage_split', F.split(F.col('`attributes.taxon-lineage`'), ';')),
)
all_genomes_tax_df.select(features).show(n=5)

Query examples on taxon-lineage column

To search the most represented taxon:
all_genomes_tax_df.groupby('`attributes.taxon-lineage`').count().filter(F.col('count')>100).show(truncate=False)

Output: Collinsella genus with no species annotated is the most represented taxon in the dataset (if you are using the example dataset)

To search for a particular lineage and count how many times it appears:
all_genomes_tax_df.filter(F.col('`attributes.taxon-lineage`').startswith('d__Bacteria;p__Actinobacteriota;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriaceae;g__Collinsella')).count()
To search for a particular genus and count how many times it appears:
all_genomes_tax_df.filter(F.col('`attributes.taxon-lineage`').contains('Collinsella')).count()

Query examples on newly created columns:

To search for the most or least represented genus, species, … in this dataset for example. The search is more flexible than for the full taxon.

all_genomes_tax_df.groupby('genus').count().filter(F.col('count')>100).show()

Output: As previously, the genus Collinsella is the most represented with 584 entries (if you are using the example dataset)
By splitting the taxon into several columns, we can see that the genus Prevotella, RC9 are among the most represented in this dataset. Also, 325 of the taxons in the database don’t have a genus. The reason why those taxons didn’t appear in the query on the taxon-lineage column is that the number of different species indexed for each genus compared to the number of hints calculated are proportionaly higher than for Collinsella genus.

all_genomes_tax_df.filter(F.col('genus').isin('g__Prevotella', 'g__RC9', 'g__Collinsella')).groupby('genus').agg(F.countDistinct('species')).show()
# To see some of the Collinsella species in the dataset:
all_genomes_tax_df.filter(F.col('genus')=='g__Collinsella').select('species').distinct().show(truncate=False)

Graphical representation

This section contains examples of graphical representations that could be obtained from this dataset.

all_genomes_tax_df.count()

Representation of the bacterial taxon-lineage

This graphic gives an idea of the proportional representation of each taxon in the dataset.

List of the features used for the graphical representation: (see section 1.4.4.1 where the feature variable is defined)

features 
all_genomes_tax_df.select([F.count_distinct(x).alias(f'{features[i]}_count') for i, x in enumerate([*features])]).show()

Output: Count of the distinct values for each categories.

Create a Sankey diagram to visually portary this Note: Colours and node size can be changed in the code below.

def get_sankey(df, cat_cols=[], value_cols='', title='Sankey Diagram'):
    # Colors
    colorPalette = ['rgba(31, 119, 180, 0.8)',
     'rgba(255, 127, 14, 0.8)',
     'rgba(44, 160, 44, 0.8)',
     'rgba(214, 39, 40, 0.8)',
     'rgba(148, 103, 189, 0.8)',
     'rgba(140, 86, 75, 0.8)',
     'rgba(227, 119, 194, 0.8)',
     'rgba(127, 127, 127, 0.8)']
    labelList = []
    colorNumList = []
    for catCol in cat_cols:
        labelListTemp =  list(set(df[catCol].values))
        colorNumList.append(len(labelListTemp))
        labelList = labelList + labelListTemp
 
    # remove duplicates from labelList
    labelList = list(dict.fromkeys(labelList))
 
    # define colors based on number of levels
    colorList = []
    for idx, colorNum in enumerate(colorNumList):
        colorList = colorList + [colorPalette[idx]]*colorNum

    # transform df into a source-target pair
    for i in range(len(cat_cols)-1):
        if i==0:
            sourceTargetDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
            sourceTargetDf.columns = ['source','target','count']
        else:
            tempDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
            tempDf.columns = ['source','target','count']
            sourceTargetDf = pd.concat([sourceTargetDf,tempDf])
        sourceTargetDf = sourceTargetDf.groupby(['source','target']).agg({'count':'sum'}).reset_index()
 
    # add index for source-target pair
    sourceTargetDf['sourceID'] = sourceTargetDf['source'].apply(lambda x: labelList.index(x))
    sourceTargetDf['targetID'] = sourceTargetDf['target'].apply(lambda x: labelList.index(x))
 
    # creating data for the sankey diagram
    data = dict(
        type='sankey',
        node = dict(
            pad = 15,
            thickness = 20,
            line = dict(
                color = "black",
                width = 0.5
            ),
            label = labelList,
            color = colorList
        ),
        link = dict(
            source = sourceTargetDf['sourceID'],
            target = sourceTargetDf['targetID'],
            value = sourceTargetDf['count']
        )
    )
    
    # override gray link colors with 'source' colors
    opacity = 0.4
    # change 'magenta' to its 'rgba' value to add opacity
    data['node']['color'] = ['rgba(255,0,255, 0.8)' if color == "magenta" else color for color in data['node']['color']]
    data['link']['color'] = [data['node']['color'][src].replace("0.8", str(opacity))
                                        for src in data['link']['source']]
    
    
    fig = go.Figure(data=[go.Sankey(
    # Define nodes
    node = dict(
      pad = 15,
      thickness = 15,
      line = dict(color = "black", width = 0.5),
      label =  data['node']['label'],
      color =  data['node']['color']
    ),
    # Add links
    link = dict(
      source =  data['link']['source'],
      target =  data['link']['target'],
      value =  data['link']['value'],
      color =  data['link']['color']
    ))])
    
    fig.update_layout(title_text=title, font_size=10)
    
    return fig.show(renderer='iframe')

Taxon-lineage represented in Genomes dataset

# Convert Spark DataFrame to Pandas DataFrame:
pdf = all_genomes_tax_df.select(features).groupby(features).count().toPandas()
# Title and number of displayed features can be modified  
get_sankey(pdf, cat_cols=features[0:4], value_cols='count', title='Taxon-lineage represented in Genomes dataset')

Representation of a sample of the genomes present in the dataset: example from the order of the Lactobacillales.

pdf_lactobacillales = all_genomes_tax_df.filter(F.col('order')=='o__Lactobacillales').select(features).groupby(features).count().toPandas()
fig_l = get_sankey(pdf_lactobacillales,cat_cols=features[0:6], value_cols='count',title='Genomes from the Lactobacillales order')
# Note that there are too many distinct species in the Lactobacillales order to show individually:
all_genomes_tax_df.filter(F.col('order')=='o__Lactobacillales').select('species').distinct().count()

Information such as genome length or GC-content can also be represented

We can group and visualise these at different levels like family, genus, species… depending on the number of sequences available and on the biological significance.

lactobacillales_df = all_genomes_tax_df.filter(F.col('order')=='o__Lactobacillales').orderBy('family').toPandas()
lactobacillales_count = all_genomes_tax_df.filter(F.col('order')=='o__Lactobacillales').groupby('family').count().orderBy('family').toPandas()
lactobacillales_count
fig = plt.figure(figsize=(10, 10), layout="constrained")
spec = fig.add_gridspec(3, 1)

ax00 = fig.add_subplot(spec[0, 0])
sns.barplot(data=lactobacillales_count, x='family', y='count')
plt.ylabel("Number of genome available")

ax10 = fig.add_subplot(spec[1, 0])
sns.boxplot(data=lactobacillales_df, x='family', y='attributes.length')
plt.ylabel("Genome length (bp)")
#plt.xlabel("Family of the Lactobacillales order")

ax20 = fig.add_subplot(spec[2, 0])
sns.boxplot(data=lactobacillales_df, x='family', y='attributes.gc-content')
plt.ylabel("GC-content (%)")
plt.xlabel("Family of the Lactobacillales order")


fig.suptitle('Number of genomes avalaible, genome length and GC-content of bacteria belonging the Lactobacillales order');
fig = plt.figure(figsize=(20, 5))
spec = fig.add_gridspec(1, 2)

#ax00 = fig.add_subplot(spec[0, 0])
#lactobacillales_df['relationships.biome.data.id'].hist()
#plt.xlabel("Biome")

ax01 = fig.add_subplot(spec[0:])
lactobacillales_df['relationships.catalogue.data.id'].hist()
plt.xlabel("Catalogue")
ax01.grid(False)

fig.suptitle('Biome and Catalogue related to bacteria belonging the Lactobacillales order');

Another example: produce a quality control figure similar to Extended Data Fig. 4a of Almeida et al 2020

qc_df = all_genomes_tax_df.toPandas()

Statistical informations can be obtained with the describe function… (see section 1.4.3)

qc_df[['attributes.completeness', 'attributes.contamination']].describe()

… and visualised in plots:

fig = plt.figure(figsize=(5, 10), layout="constrained")
spec = fig.add_gridspec(1, 1)

ax00 = fig.add_subplot(spec[0, 0])
sns.boxplot(data=qc_df[['attributes.completeness', 'attributes.contamination']])
plt.ylabel("%")


fig.suptitle('Quality of genomes avalaible');

Find out whether your own MAGs are novel compared to the MGnify catalogues

Another use for the MGnify genomes resource is to query your own MAG against MGnify’s MAG catalogues, to see whether they are novel or already represented.

List directories of the files to be analysed:

Replace the str with your own path to folder containing your files. * allows to query all the file with the .fa extension.

files = glob.glob('../example-data/genomes/*.fa')
files

Compute a sourmash sketch for each MAG

Create “sketches” for each MAG using Sourmash

A sketch goes into a signature, that we will use for searching. The signature is a sort of collection of hashes that are well suited for calculating the containment of your MAGs within the catalogue’s MAGs.

for mag in files:
    # The sourmash parameters are chosen to match those used within MGnify
    sketch = sourmash.MinHash(n=0, ksize=31, scaled=1000)
    
    # A fasta file may have multiple records in it. Add them all to the sourmash signature.
    for index, record in enumerate(SeqIO.parse(mag, 'fasta')):
        sketch.add_sequence(str(record.seq))
        
    # Save the sourmash sketch as a "signature" file
    signature = sourmash.SourmashSignature(sketch, name=record.name)
    with open(pp(pp(mag).name).stem + '.sig', 'wt') as fp:
        sourmash.save_signatures([signature], fp)

Fetch all of the catalogue IDs currently available on MGnify

To fetch the catalogue IDs to the MGnify API, use the following endpoint: https://www.ebi.ac.uk/metagenomics/api/v1/genome-catalogues.

catalogue_endpoint = "genome-catalogues"
with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
    catalogues = map(lambda r: r.json, mgnify.iterate(catalogue_endpoint))
    catalogues = pd.json_normalize(catalogues)
catalogue_ids = list(catalogues['id'])
catalogue_ids

Submit a search job to the MGnify API

Tosubmit a job to the MGnify API, use the following endpoint: https://www.ebi.ac.uk/metagenomics/api/v1/genomes-search/gather.
Data will be send to the API, which is called “POST”ing data in the API world.
This part of the API is quite specialized and so is not a formal JSON:API, the requests Python packageìs therefore used to communicate with it.

endpoint = 'https://www.ebi.ac.uk/metagenomics/api/v1/genomes-search/gather'
# Create a list of file uploads, and attach them to the API request
signatures = [open(sig, 'rb') for sig in glob.glob('*.sig')]
sketch_uploads = [('file_uploaded', signature) for signature in signatures]

# Send the API request - it specifies which catalogue to search against and attaches all of the signature files.
submitted_job = requests.post(endpoint, data={'mag_catalogues': catalogue_ids}, files=sketch_uploads).json()


map(lambda fp: fp.close(), signatures)  # tidy up open file pointers

print(submitted_job)

Wait for the results to be ready

As you can see in the printed submitted_job above, a status_URL was returned in the response from submitting the job via the API. Since the job is in a queue, this status_URL must be polled to wait for our job to be completed.
Below is an example to check every 2 seconds until ALL of the jobs are finished. The time can be easily change (to 10s in the example below) by setting a different sleeping value:

time.sleep(10)
job_done = False
while not job_done:
    print('Checking status...')
    # The status_URL is another API endpoint that's unique for the submitted search job
    query_result = None
    
    while not query_result:
        query_result = requests.get(submitted_job['data']['status_URL'])
        print('Still waiting for jobs to complete. Current status of jobs')
        print('Will check again in 2 seconds')
        time.sleep(2) 
        
    queries_status = {sig['job_id']: sig['status'] for sig in query_result.json()['data']['signatures']}
    job_done = all(map(lambda q: q == 'SUCCESS', queries_status.values()))
    
print('Job done!')

The query_result contains the results of the query.
The results can be visualised as json (try entering query_results.json()), or as a Pandas dataframe:

query_result_df = pd.json_normalize(query_result.json()['data']['signatures'])
query_result_df

Output: Each signature is queried against each catalogue entry.
Results can then be analysed according to your research goal, for example looking for representation in Host-Associated biome catalogues vs. Environmental ones.

Are any of our MAGs found in biomes other than the human gut?

matches = query_result_df.dropna(subset=['result.match'])
matches

Output: DataFrame containing only the results with a positive match in the catalogue.

fig = plt.figure(figsize=(3, 3))
spec = fig.add_gridspec(1, 2)

ax01 = fig.add_subplot(spec[0:])
matches.catalogue.hist()
plt.xlabel("Catalogue")
ax01.grid(False)



fig.suptitle('Biomes matched by the MAGs');

Output: Display which biomes that were matched during the search by the MAGs. The five signatures match various numbers of genomes in the human gut, cow rumen, and pig gut catalogues.

What is the taxonomy of the MGnify MAGs which match the query?

Call the API for each Genome, to find it’s taxonomic lineage.

def get_taxonomy_of_mgnify_mag(match_row):
    mgyg_accession = match_row['result.match']
    with APISession("https://www.ebi.ac.uk/metagenomics/api/v1") as mgnify:
        genome_document = mgnify.get('genomes', mgyg_accession)
        return genome_document.resource.taxon_lineage

Create a new column in the matches the table.

matches['best_match_taxonomy'] = matches.apply(get_taxonomy_of_mgnify_mag, axis=1)
matches

Output: The matches DataFrame has a new column that contains the corresponding taxon-lineage information.

Summarise results with more verbose:

for row, match in matches.iterrows():
    print(f"The MAG ({match['filename']}) matches {match['result.match']} which has taxonomy {match['best_match_taxonomy']}")

Which of our MAGs are completely novel (i.e. in no MGnify catalogue)

One way to check this is to group all of the search results by filename (i.e. finding the queries for each MAG vs all catalogues) and checking whether the sum of all matches is 0…

query_result_df.groupby('filename').apply(lambda query: query['result.matches'].sum() == 0)

If you’re using the example dataset for this example, none of the MAGs are completely novel.